Objective
The objective is to build a regression model which predicts houseprice from several housing variables.
Data Processing
Area and Prices are quantitative variables measured in square-feet and dollars respectively. Garage, FirePlace and Baths refers to the number of this items in a specific house. City is a qualitative variable indicating one of 3 different cities. All remaining variable indicate the presence (1) of absence (0) of that feature.
| variable | mean | min | median | max | nunique | nmissing | eltype | |
|---|---|---|---|---|---|---|---|---|
| Symbol | Float64 | Int64 | Float64 | Int64 | Nothing | Nothing | DataType | |
| 1 | :Area | 124.93 | 1 | 125.0 | 249 | nothing | nothing | Int64 |
| 2 | :Garage | 2.00129 | 1 | 2.0 | 3 | nothing | nothing | Int64 |
| 3 | :FirePlace | 2.0034 | 0 | 2.0 | 4 | nothing | nothing | Int64 |
| 4 | :Baths | 2.99807 | 1 | 3.0 | 5 | nothing | nothing | Int64 |
| 5 | :WhiteMarble | 0.332992 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 6 | :BlackMarble | 0.33269 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 7 | :IndianMarble | 0.334318 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 8 | :Floors | 0.499386 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 9 | :City | 2.00094 | 1 | 2.0 | 3 | nothing | nothing | Int64 |
| 10 | :Solar | 0.498694 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 11 | :Electric | 0.50065 | 0 | 1.0 | 1 | nothing | nothing | Int64 |
| 12 | :Fiber | 0.500468 | 0 | 1.0 | 1 | nothing | nothing | Int64 |
| 13 | :GlassDoors | 0.49987 | 0 | 0.0 | 1 | nothing | nothing | Int64 |
| 14 | :SwimingPool | 0.500436 | 0 | 1.0 | 1 | nothing | nothing | Int64 |
| 15 | :Garden | 0.501646 | 0 | 1.0 | 1 | nothing | nothing | Int64 |
| 16 | :Prices | 42050.1 | 7725 | 41850.0 | 77975 | nothing | nothing | Int64 |
Data Exploration
Data was manipulated to investigate one of three things:
Distribution of house price
How area affects the average house price
What house price variable contribute to house price
xxxxxxxxxxbegin # Dicts for code manipulation D1 = Dict("001" => "Indian", "010" => "Black", "100" => "White") D2 = Dict(0 => "No", 1 => "Yes") # Manipulate data for data exploration houseprice_ex = houseprice |> (MarbleType = string(_.WhiteMarble) * string(_.BlackMarble) * string(_.IndianMarble)) |> (MarbleType = D1[_.MarbleType]) |> (Floors = D2[_.Floors]) |> # Floors (Solar = D2[_.Solar]) |> # Solar (Electric = D2[_.Electric]) |> # Electric (Fiber = D2[_.Fiber]) |> # Fiber (GlassDoors = D2[_.GlassDoors]) |> # GlassDoors (SwimingPool = D2[_.SwimingPool]) |> # SwimingPool (Garden = D2[_.Garden]) |> # Garden (1:4, 17, 8:16) |> DataFrame # Code creates expressions for aggregation and plotting of dependent variable against average price function aggprice(v::Symbol, x::String) :( houseprice_ex |> (_.$v) |> ({$v = key(_), Avg_Price = mean(_.Prices)}) |> (_.$v) |> ( mark = :bar, x = $x, y = {"Avg_Price:q", title = "Average Price"} ) ) end # Data plots of dependent variables vs price Garage = eval(aggprice(:Garage, "Garage:n")) # Garage y FirePlace = eval(aggprice(:FirePlace, "FirePlace:n")) # FirePlace y Baths = eval(aggprice(:Baths, "Baths:n")) # Baths y MarbleType = eval(aggprice(:MarbleType, "MarbleType:n")) # MarbleType yy Floors = eval(aggprice(:Floors, "Floors:n")) # Floors yy City = eval(aggprice(:City, "City:n")) # City y Solar = eval(aggprice(:Solar, "Solar:n")) # Solar n Electric = eval(aggprice(:Electric, "Electric:n")) # Electric 0.5y Fiber = eval(aggprice(:Fiber, "Fiber:n")) # Fiber yy GlassDoors = eval(aggprice(:GlassDoors, "GlassDoors:n")) # GlassDoors 0.5y SwimingPool = eval(aggprice(:SwimingPool, "SwimingPool:n")) # SwimingPool n Garden = eval(aggprice(:Garden, "Garden:n")); # Garden n # Average House Price by Area Range ahp = houseprice_ex |> (10000) |> ( mark = {:bar}, x = {"Area:n", bin={maxbins=20}, title = "Area Range"}, y = {"average(Prices)", scale = {zero = false}, title = "Average House Price"}, height = 200, width = 400, title = "Average House Price by Area Range" ) # House Price Density Plot houseprice_dp = houseprice_ex |> (10000) |> ( mark = :area, # It's an area plot transform = [{density = "Prices"}], # For counts add in : counts = true x = {"value:q", title = "House Price"}, y = {"density:q"}, # Density variable created width = 400, title = "House Price Density Plot" )end;The distribution of house price is generally bell shaped, with a high number house-prices in the 35,000 to 40,000 range.
xxxxxxxxxxhouseprice_dpAverage house prices generally increase with area range, however that trend is broken in the range 80-120 sqft as well as 180-200 sqft.
xxxxxxxxxxahp Having white marble type and a fiber connection tends to be a strong predictor of price. Variables like Garage, FirePlace, Baths and Floors which increase with Area also tend to predict higher prices. Non predictors of price seem to includes features like Solar SwimmingPool and Garden.
xxxxxxxxxxbegin("title"="Variable Impacts on House Prices") + [[Garage FirePlace Baths MarbleType] ; [Floors City Electric Fiber GlassDoors] ; [Solar SwimingPool Garden]] endModel Building
The data was split into a training and testing set (70/30). The data science pipeline requires converting variables into a continuous type, then fitting a EvoTree Regressor model to predict house prices, using a max_dept of 8.
xxxxxxxxxxbegin # Filter data for useful variables and convert Prices to float houseprice_mod = houseprice |>(-:Solar, -:SwimingPool, -:Garden) |> (Prices = float(_.Prices)) |> DataFrame; # Select X and y for modeling X = houseprice_mod |> (-:Prices) |> DataFrame y = houseprice_mod.Prices # Create an index for model and validation sets m, v = partition(eachindex(y), 0.7, shuffle=true); # Create model/training sets Xm = X[m,:] ym = y[m] # Create validation/testing sets Xv = X[v,:] yv = y[v] # load EvoTreeRegressor model EvoTreeRegressor pkg="EvoTrees" # Create pipeline that converts inputs into continous and fits an EvoTreeRegressor pipe = MLJ.( X -> coerce(X, Count => Continuous), EvoTreeRegressor(max_depth = 8), prediction_type=:deterministic ); # Fit machine mach = machine(pipe, Xm, ym); fit!(mach); end;Pipeline527(
evo_tree_regressor = EvoTreeRegressor(
loss = EvoTrees.Linear(),
nrounds = 10,
λ = 0.0f0,
γ = 0.0f0,
η = 0.1f0,
max_depth = 8,
min_weight = 1.0f0,
rowsample = 1.0f0,
colsample = 1.0f0,
nbins = 64,
α = 0.5f0,
metric = :mse,
seed = 444))[34m @883[39mxxxxxxxxxxpipeCross Validation
The number of rounds of training (nrounds) was plotted on a learning curve. After about 120 rounds there are diminishing returns for a low rms error, as a result the original model will be retrained with nrounds=128 rather than nrounds=10.
A sufficient number of rounds is needed to achieve rms > 300, as indicated by the chart below
xxxxxxxxxxplot_lc(curve, "rms")Model Evaluation Evaluation was done using the metrics rms/mae and a 70% shuffled resampling. A reasonable error was achieved for both metrics.
| measure | value | |
|---|---|---|
| Measure | Float64 | |
| 1 | rms (callable Measure) | 229.689 |
| 2 | mae (callable Measure) | 184.437 |
xxxxxxxxxxDataFrame(measure=ev[1], value = ev[2])A rms error of 231.0 was also achieved for the test set.
Variable Importance
The chart below ranks variables in types of contribuion to the model, under the Shapley framework. Interestingly, Floors, Fiber and WhiteMarble are the best predictors of house price.
x
begin # Takes a data_shap dataframe and returns a better formated one function better_shap(ds::DataFrame) DataFrame( feature_name = ds.feature_name, feature_value = float.(identity.(ds.feature_value)), shap_effect = Float64.(ds.shap_effect) ) end # Function for creating a data-field with positive/negative function polarity(r::Real) if r >= 0 return("positive") else return("negative") end end # Function to plot shapley data function shap_plot(data_shap::DataFrame) #Shows important variables better_shap(data_shap) |> (abs_shap_effect = abs(_.shap_effect)) |> (_.feature_name) |> ({variable = key(_), average_abs_shap_effect = mean(_.abs_shap_effect), correlation = corkendall(_.shap_effect, _.feature_value)}) |> (impact = polarity(_.correlation)) |> ( mark = :bar, y = {"variable:n", sort = "-x", title = "feature"}, x = {"average_abs_shap_effect:q", title = "mean |shap effect|"}, color = "impact:n", title = "Shapley Variable Importance" ) end # Get a sample of values explain = Xm |> (5000) |> DataFrame; # Prediction function to dataframe function predict_function(model, data) data_pred = DataFrame(y_pred = predict(model, data)) return data_pred end # Compute stochastic Shapley values. data_shap = ShapML.shap(explain = explain, model = mach, predict_function = predict_function, sample_size = 60, seed = 1 ); # Plot variable importance shap_plot(data_shap)endThis chart shows how important variables correlate with house price. For example having Indian Marble has a negative correlation on house price and is consistent with the shapely variable importance plot. For other variables having a feature or having more of it correlates with higher house price.